*: reduce getGCState freq#10817
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
💤 Files with no reviewable changes (4)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a fetch-strategy enum and updates PDClientHelper::getGCSafePointWithRetry to accept GCSafepointFetchStrategy, changes several call sites to use CacheOnly where appropriate, and adds tests verifying cache-only and cache-with-refresh behaviors. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as "Caller\n(Storage / Debug / DeltaMerge)"
participant Helper as "PDClientHelper"
participant Cache as "Local Cache\n(ks_gc_sp_map)"
participant PD as "PD Server\n(pd_client)"
Caller->>Helper: getGCSafePointWithRetry(keyspace, fetch_strategy)
Helper->>Cache: lookup cached safepoint for keyspace
alt fetch_strategy == CacheOnly
Cache-->>Helper: cached value or miss
Helper-->>Caller: return cached value (or 0 on miss)
else fetch_strategy == UpdateCacheIfNeeded
Cache-->>Helper: cached value (may be expired)
alt cached valid within interval
Helper-->>Caller: return cached value
else cached absent/expired
Helper->>PD: getGCState(...) (with retry/backoff)
PD-->>Helper: gc_safe_point
Helper->>Cache: update cache if gc_safe_point != 0
Helper-->>Caller: return gc_safe_point
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dbms/src/Storages/StorageDeltaMerge.cpp (1)
719-745:⚠️ Potential issue | 🟠 Major
checkStartTsbecomes a no-op on cache miss — this is intentional but needs explicit documentation.
checkStartTsis the safety net that rejects queries whosestart_tsis below the GC safepoint. WithGCSafepointFetchStrategy::CacheOnly,getGCSafePointWithRetryreturns0on cache miss (confirmed in PDTiKVClient.h:200-208), making the comparisonstart_ts < 0always false — the check is silently skipped.The design is intentional: background/non-query callers (SchemaSyncService with
ignore_cache=true, PrehandleSnapshot, DeltaMergeStore_InternalBg) populate the cache via PD, while query paths consume only the cached value to avoid per-query PD traffic. This is validated by explicit test coverage (CacheOnlyReadPathDoesNotFetchFromPD).However, a startup window exists: if a query arrives before any background path has populated the cache for a given keyspace (fresh TiFlash process or a new keyspace that none of the background tasks have touched yet), the safety check is bypassed entirely. This trade-off between startup safety and steady-state performance should be:
- Explicitly documented in the commit message or PR description — the current log message suggests the default behavior is preserved, which is misleading.
- Confirmed acceptable by the authors — either the startup window is guaranteed short in practice (e.g., schema sync always runs before first query), or the risk is acceptable by design.
Consider adding a note in the code or PR explaining why this trade-off is acceptable at TiFlash startup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/StorageDeltaMerge.cpp` around lines 719 - 745, The checkStartTs function uses PDClientHelper::getGCSafePointWithRetry(..., GCSafepointFetchStrategy::CacheOnly) (called from checkStartTs) which returns 0 on cache miss, effectively making the start_ts < safe_point check a no-op during cold starts; add an explicit in-code comment above checkStartTs (or adjacent to the PDClientHelper call) stating that CacheOnly yields 0 on miss, that background callers (SchemaSyncService, PrehandleSnapshot, DeltaMergeStore_InternalBg) are responsible for populating the cache, and document the startup window trade-off and why it is acceptable (or link to the PR/issue) so reviewers/readers are not misled by the current behavior/logging.
🧹 Nitpick comments (3)
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp (1)
1268-1268: Wall-clock sleep adds mild flakiness.
sleep_for(2s)combined withsafe_point_update_interval_seconds=1is fine in practice, but makes this test timing-dependent and slow. If you ever want to make it deterministic, the cache usessteady_clockinsidegetGCSafepointIfValid, so a custom clock injection or exposing an "expire now" hook would eliminate the sleep. Not required for this PR.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp` at line 1268, The test uses a wall-clock sleep (std::this_thread::sleep_for(std::chrono::seconds(2))) which makes it timing-dependent and slow; replace this with a deterministic approach by injecting a test clock or adding an "expire now" hook so the cache/GC safepoint logic can be advanced without sleeping — target the code paths around getGCSafepointIfValid and the safe_point_update_interval_seconds behavior (use a steady_clock-injectable implementation or call an exposed method to force expiry) so the test can trigger the same cache refresh immediately and remove the sleep.dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)
198-225: Consider whether the CacheOnly path should emittiflash_gc_safepoint_backoff_count{type=success}.In the
CacheOnlybranch no PD request is ever made, so callingobserve_backoff_count(true)(withbackoff_count == 0) inflates thetype_successhistogram with samples that do not correspond to an actual PD fetch. Previously every observation in that metric reflected a real PD interaction; after this change the vast majority of observations will come from cache-only lookups on the hot read path and the metric will mostly report zeros. If the intent of the metric is to track PD-fetch backoff, consider skipping the observation on theCacheOnlycache-hit/miss path (and also skipping it on the existing fast-path cache hit).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 198 - 225, The CacheOnly branch and the fast-path cached-hit branch currently call observe_backoff_count(true) even though no PD request occurs; remove those calls so the tiflash_gc_safepoint_backoff_count{type=success} metric is only observed when an actual PD fetch/backoff happens. Specifically, in PDTiKVClient.h inside the code paths handling GCSafepointFetchStrategy::CacheOnly and the getGCSafepointIfValid(cache-hit) branch, eliminate the observe_backoff_count(true) calls and ensure observe_backoff_count is only invoked in the code path(s) that perform a real PD fetch (the fallback PD-fetch logic). Use the existing symbols ks_gc_sp_map.getGCSafepoint, ks_gc_sp_map.getGCSafepointIfValid, and observe_backoff_count to locate and adjust the calls.dbms/src/Storages/StorageDeltaMerge.cpp (1)
915-915:checkStartTsis now invoked three times per read; with CacheOnly this is effectively identical to a single call.Previously each
checkStartTscall could trigger a PD fetch when the cache was stale, so invoking it pre-read / post-read / post-snapshot gave each call an independent chance to pick up a freshly advanced safepoint. WithCacheOnlyall three calls read the same cached value (unless a background path races in between), so the "ensure after read" invariant the comments describe no longer adds meaningful coverage. Worth a short note in the PR or code comments that the post-read checks are kept as a defense-in-depth against a future background refresh between the two calls, rather than active safety.Also applies to: 1011-1011, 1057-1057
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/StorageDeltaMerge.cpp` at line 915, Multiple calls to checkStartTs(mvcc_query_info.start_ts, context, query_info.req_id, keyspace_id) now read the same cached safepoint under CacheOnly, so the post-read invocations no longer increase coverage; update the code comment near the checkStartTs calls (the pre-read/post-read/post-snapshot invocations) to state explicitly that with CacheOnly the checks observe the same cached value and that the extra calls are retained as defense-in-depth only (to catch a rare background refresh/race), rather than for active additional safety, so future readers understand why we keep the redundant calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@dbms/src/Storages/StorageDeltaMerge.cpp`:
- Around line 719-745: The checkStartTs function uses
PDClientHelper::getGCSafePointWithRetry(...,
GCSafepointFetchStrategy::CacheOnly) (called from checkStartTs) which returns 0
on cache miss, effectively making the start_ts < safe_point check a no-op during
cold starts; add an explicit in-code comment above checkStartTs (or adjacent to
the PDClientHelper call) stating that CacheOnly yields 0 on miss, that
background callers (SchemaSyncService, PrehandleSnapshot,
DeltaMergeStore_InternalBg) are responsible for populating the cache, and
document the startup window trade-off and why it is acceptable (or link to the
PR/issue) so reviewers/readers are not misled by the current behavior/logging.
---
Nitpick comments:
In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`:
- Line 1268: The test uses a wall-clock sleep
(std::this_thread::sleep_for(std::chrono::seconds(2))) which makes it
timing-dependent and slow; replace this with a deterministic approach by
injecting a test clock or adding an "expire now" hook so the cache/GC safepoint
logic can be advanced without sleeping — target the code paths around
getGCSafepointIfValid and the safe_point_update_interval_seconds behavior (use a
steady_clock-injectable implementation or call an exposed method to force
expiry) so the test can trigger the same cache refresh immediately and remove
the sleep.
In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 198-225: The CacheOnly branch and the fast-path cached-hit branch
currently call observe_backoff_count(true) even though no PD request occurs;
remove those calls so the tiflash_gc_safepoint_backoff_count{type=success}
metric is only observed when an actual PD fetch/backoff happens. Specifically,
in PDTiKVClient.h inside the code paths handling
GCSafepointFetchStrategy::CacheOnly and the getGCSafepointIfValid(cache-hit)
branch, eliminate the observe_backoff_count(true) calls and ensure
observe_backoff_count is only invoked in the code path(s) that perform a real PD
fetch (the fallback PD-fetch logic). Use the existing symbols
ks_gc_sp_map.getGCSafepoint, ks_gc_sp_map.getGCSafepointIfValid, and
observe_backoff_count to locate and adjust the calls.
In `@dbms/src/Storages/StorageDeltaMerge.cpp`:
- Line 915: Multiple calls to checkStartTs(mvcc_query_info.start_ts, context,
query_info.req_id, keyspace_id) now read the same cached safepoint under
CacheOnly, so the post-read invocations no longer increase coverage; update the
code comment near the checkStartTs calls (the pre-read/post-read/post-snapshot
invocations) to state explicitly that with CacheOnly the checks observe the same
cached value and that the extra calls are retained as defense-in-depth only (to
catch a rare background refresh/race), rather than for active additional safety,
so future readers understand why we keep the redundant calls.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 08e9eb71-838f-4216-a4d9-5768a598805d
📒 Files selected for processing (3)
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.hdbms/src/Storages/KVStore/tests/gtest_new_kvstore.cppdbms/src/Storages/StorageDeltaMerge.cpp
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp (1)
1279-1279: Optional: reduce fixed wait time in expiry test.
2sworks, but for a1svalidity window, a slightly-above-1s wait (e.g.1100ms) keeps semantics and trims test runtime.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp` at line 1279, Replace the long fixed 2s sleep in the expiry test's std::this_thread::sleep_for call with a slightly-above-1s duration (e.g., 1100ms) so the test still exceeds the 1s validity window but runs faster; locate the std::this_thread::sleep_for(...) invocation in the expiry test and change the duration to std::chrono::milliseconds(1100).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`:
- Around line 1223-1261: The test leaves
PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id) as a trailing cleanup
that won't run if an ASSERT_* aborts the test; make cleanup RAII-safe by
creating a small scope guard (e.g., KeyspaceGCSafepointGuard or a
std::unique_ptr with a custom deleter) that calls
PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id) in its destructor and
instantiate it immediately after defining keyspace_id; apply the same RAII guard
to the other duplicate cleanup site around lines covered by the second block
(the case around 1265-1304) so the safepoint is always removed even on assertion
failures.
---
Nitpick comments:
In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`:
- Line 1279: Replace the long fixed 2s sleep in the expiry test's
std::this_thread::sleep_for call with a slightly-above-1s duration (e.g.,
1100ms) so the test still exceeds the 1s validity window but runs faster; locate
the std::this_thread::sleep_for(...) invocation in the expiry test and change
the duration to std::chrono::milliseconds(1100).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 3206a305-f1a7-4684-ad8c-8ca7cce269ca
📒 Files selected for processing (1)
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp
There was a problem hiding this comment.
♻️ Duplicate comments (1)
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp (1)
1226-1263:⚠️ Potential issue | 🟡 MinorMake cache cleanup assertion-safe with RAII.
If any
ASSERT_*aborts early, the trailing cleanup won’t run and cached safepoint state can leak into later tests.Based on learnings: Ensure `TearDown()` is called or cleanup helpers are used to avoid side effects on other tests.Proposed fix
TEST(PDClientHelperTest, CacheOnlyReadPathDoesNotFetchFromPD) { constexpr KeyspaceID keyspace_id = 9527; PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id); + SCOPE_EXIT({ PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id); }); @@ - PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id); } TEST(PDClientHelperTest, CacheOnlyReadPathCanReturnExpiredCache) { constexpr KeyspaceID keyspace_id = 9528; PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id); + SCOPE_EXIT({ PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id); }); @@ - PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id); }Also applies to: 1268-1305
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp` around lines 1226 - 1263, Wrap the PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id) cleanup in an RAII/TearDown-safe mechanism so it always runs even if ASSERT_* aborts; specifically, create a scoped cleanup (or move the test into a test fixture and implement TearDown) that calls PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id) and use it around the calls to PDClientHelper::getGCSafePointWithRetry and related assertions (references: PDClientHelper::removeKeyspaceGCSafepoint, PDClientHelper::getGCSafePointWithRetry, and the failing test in gtest_new_kvstore.cpp); apply the same RAII/fixture cleanup to the other similar block noted (lines ~1268-1305).
🧹 Nitpick comments (1)
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp (1)
1281-1281: Prefer controlled staleness over wall-clock sleep.
std::this_thread::sleep_forhere adds avoidable test latency and can introduce CI timing flakiness; consider a failpoint/sync-point based trigger for cache-expiry simulation.Based on learnings: storage engine tests should rely on failpoints (and SyncPointCtl) to simulate timing/race conditions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp` at line 1281, The test currently uses std::this_thread::sleep_for to simulate cache expiry, which adds latency and flakiness; replace that wall-clock sleep by driving a deterministic test synchronisation: remove std::this_thread::sleep_for and instead trigger the cache-expiry path via a failpoint or SyncPointCtl (e.g., enable/trigger a named failpoint or SyncPoint and wait for its notification) so the test explicitly advances the component under test (cache invalidation/expiry) and blocks on the sync-point rather than sleeping; look for occurrences of std::this_thread::sleep_for in gtest_new_kvstore.cpp and wire the test to use the existing FailPoint/SyncPointCtl APIs to simulate the timing event.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`:
- Around line 1226-1263: Wrap the
PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id) cleanup in an
RAII/TearDown-safe mechanism so it always runs even if ASSERT_* aborts;
specifically, create a scoped cleanup (or move the test into a test fixture and
implement TearDown) that calls
PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id) and use it around the
calls to PDClientHelper::getGCSafePointWithRetry and related assertions
(references: PDClientHelper::removeKeyspaceGCSafepoint,
PDClientHelper::getGCSafePointWithRetry, and the failing test in
gtest_new_kvstore.cpp); apply the same RAII/fixture cleanup to the other similar
block noted (lines ~1268-1305).
---
Nitpick comments:
In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`:
- Line 1281: The test currently uses std::this_thread::sleep_for to simulate
cache expiry, which adds latency and flakiness; replace that wall-clock sleep by
driving a deterministic test synchronisation: remove std::this_thread::sleep_for
and instead trigger the cache-expiry path via a failpoint or SyncPointCtl (e.g.,
enable/trigger a named failpoint or SyncPoint and wait for its notification) so
the test explicitly advances the component under test (cache
invalidation/expiry) and blocks on the sync-point rather than sleeping; look for
occurrences of std::this_thread::sleep_for in gtest_new_kvstore.cpp and wire the
test to use the existing FailPoint/SyncPointCtl APIs to simulate the timing
event.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 29f9d944-908f-4701-9e71-0b671477d651
📒 Files selected for processing (1)
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp
| throw pingcap::Exception("not implemented", pingcap::ErrorCodes::UnknownError); | ||
| } | ||
|
|
||
| bool isMock() override { return false; } |
| const auto min_interval = std::max(static_cast<Int64>(1), safe_point_update_interval_seconds); | ||
| auto ks_gc_info = ks_gc_sp_map.getGCSafepointIfValid(keyspace_id, min_interval); | ||
| if (ks_gc_info.has_value()) | ||
| if (fetch_strategy == GCSafepointFetchStrategy::CacheOnly) |
There was a problem hiding this comment.
it seems if the input argument is ignore_cache = false and fetch_strategy = CacheOnly, the CacheOnly will be ignored
There was a problem hiding this comment.
Maybe we should totally remove the param ignore_cache?
There was a problem hiding this comment.
@JaySon-Huang I think I can change all remaining two occurrence into false, then I will try to rerun the test to see if there is any error.
| namespace | ||
| { | ||
|
|
||
| class CountingPDClient : public pingcap::pd::IClient |
There was a problem hiding this comment.
Can we use failpoint instead of constructing a whole CountingPDClient just for testing?
There was a problem hiding this comment.
I let AI generate one version for me, and I think it not very good.
Because we need to add hooks including:
- force_pd_get_gc_state_safe_point
- force_pd_get_gc_state_resp_error
- force_pd_get_gc_state_throw
- force_pd_get_gc_state_count
Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JaySon-Huang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
What problem does this PR solve?
Issue Number: close #10818
Problem Summary:
What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note
Summary by CodeRabbit
Performance
Behavior
Tests